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These are the notes for two lectures delivered at the Les Houches summer school Mathe- 
matical Statistical Mechanics^ held in July 2005. 

I review some basic notions on sparse graph error correcting codes with emphasis on 
'modern' aspects, such as, iterative belief propagation decoding. Relations with statistical 
mechanics, inference and random combinatorial optimization are stressed, as well as some 
general mathematical ideas and open problems. 

I. INTRODUCTION 

Imagine to enter the auditorium and read the following (partially erased) phrase on the black- 
board 

TH* L*CTU*E OF *********** yVA* EX**EMELY B*RING. 

You will be probably able to reconstruct most of the words in the phrase despite the erasures. 

The reason is that English language is redundant. One can roughly quantify this redundancy as 
follows. The English dictionary contains about 10^ words including technical and scientific terms. 
On the other hand, the average length of these words is about 8.8 letters ^ . A conservative estimate 
of the number of 'potential' English words is therefore 26^ ~ 2 • 10^^. A tiny fraction (about 10~^) 
of these possibilities is realized This is of course a waste from the point of view of information 
storage, but it allows for words to be robust against errors (such as the above erasures). Of course, 
they are not infinitely robust: above some threshold the information is completely blurred out by 
noise (as in the case of the name of the speaker in our example). 

A very naive model for the redundancy of English could be the following. In order for a word to 
be easily pronounced, it must contain some alternation of vowels and consonants. Let us be rough 
and establish that an English word is a sequence of 8 letters, not containing two consecutive vowels 



* UMR 8549, Unite Mixte de Recherche du Centre National de la Recherche Scientifique et de 1' Ecole Normale 
Superieure. 

^ This estimate was obtained by the author averaging over 40 words randomly generated by the site 
http : //www, wordbrowser .net/wb/wb. html 
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FIG. 1: A (too) naive model for the English language. 
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FIG. 2: Example of a Tanner graph. 

and consonants. This yields 2 • 21^5^ ~ 2.4 • 10^ distinct words, which overestimates the correct 
number 'only' by a factor 240. A graphical representation of this model is reproduced in Fig. ^ 

The aim of coding theory is to construct an optimal 'artificial dictionary' allowing for reliable 
communication through unreliable media. It is worth introducing some jargon of this discipline. 
Words of natural languages correspond to codewords in coding. Their length (which is often 
considered as fixed) is called the blocklength: we shall denote it by N throughout these lectures. 
The dictionary (i.e. the set of all words used for communication) is called codebook and denoted 
as C. As in our example, the dictionary size is usually exponential in the blocklength \C\ = 
and R is called the code rate. Finally, the communication medium referred to as the channel and 
is usually modeled in a probabilistic way (we shall see below a couple of examples). 

II. CODES ON GRAPHS 

We shall now construct a family of such 'artificial dictionaries' (codes). For the sake of simplicity, 
codewords will be formed over the binary alphabet {0, 1}. Therefore a codeword x G C will be an 
element of the Hamming space {0, 1}^ or, equivalently a vector of coordinates {xi,X2, ■ ■ ■ , = x. 

The codebook C is a subset of {0, 1}^. Inspired by our simple model of English, we shall define 
C by stipulating that ^ is a codeword if and only if a certain number M of constraints on the bits 
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xi, . . . ,xn are met. In order to specify these constraints, we will draw a bipartite graph (Tanner 
graph) over vertices sets [N] and [M]. Vertices in these sets will be called, respectively, variable 
nodes (denoted as and check nodes (a, 6, . . . ). If we denote by z^, ig, • • • , the variable 

nodes adjacent to check node a in the graph, then Xjj, Xjg, . . . , Xi^ must satisfy some constraint in 
order for x to be a codeword. An example of a Tanner graph is depicted in Fig. El 

Which type of constraints are we going to enforce on the symbols . . . adjacent to 

the same check? The simplest and most widespread choice is a simple parity check condition: 
a^ij © © • • • © = (where © is my notation for sum modulo 2). We will stick to this choice, 
although several of the ideas presented below are easily generalized. Notice that, since the parity 
check constraint is linear in x, the code C is a linear subspace of {0, 1}^, of size \C\ > 2^~^ 
(and in fact \C\ = 2^~*^ unless redundant constraints are used in the code definition). For general 
information theoretic reasons one is particularly interested in the limit of large blocklength N ^ oo 
at fixed rate. This implies that the number of checks per variable is kept fixed: M/N = 1 — R. 

Once the general code structure is specified, it is useful to define a set of parameters which 
characterize the code. Eventually, these parameters can be optimized over to obtain better error 
correction performances. A simple set of such parameters is the degree profile (A, P) of the code. 
Here A = (Aq, . . . , A^^^^^), where Ai is the fraction of variable nodes of degree I in the Tanner graph. 
Analogously, P = {Pq, . . . , Pkim,^), where Pk is the fraction of check nodes of degree k. 

Given the degree profile, there is of course a large number of graphs having the same profile. 
How should one chose among them? In his seminal 1948 paper, Shannon first introduced the idea 
of randomly constructed codes. We shall follow his intuition here and assume that the Tanner 
graph defining C is generated uniformly at random among the ones with degree profile (A, P) and 
blocklength N. The corresponding code (graph) ensemble is denoted as LDPCAr(A, P) (respectively 
GAr(A, P)), an acronym for low-density parity-check codes. Generically, one can prove that some 
measure of the code performances concentrates in probability with respect to the choice of the code 
in the ensemble. Therefore, a random code is likely to be (almost) as good as (almost) any other 
one in the ensemble. 

An particular property of the random graph ensemble will be useful in the following. Let 
G = GAr(A, P) and i a uniformly random variable node in G. Then, with high probability (i.e. 
with probability approaching one in the large blocklength limit), the shortest loop in the G through 
i is of length 0(log A^). 
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III. A SIMPLE-MINDED BOUND AND BELIEF PROPAGATION 



A. ChEiracterizing the code performances 



Once the code is constructed, we have to convince ourselves (or somebody else) that it is going 
to perform well. The first step is therefore to produce a model for the communication process, i.e. 
a noisy channel. A simple such model (usually referred to as binary symmetric channel, or 
BSC(p)) consists in saying that each bit xi is flipped independently of the others with probability 
p. In other words the channel output yi is equal to Xi with probability 1 — p and different with 
probability p. This description can be encoded in a transition probability kernel Q{y\x). For 
BSC(p) we have Q(0|0) = Q(l|l) = l-p and Q(l|0) = Q{0\l) = p. More generally, we shall 
consider transition probabilities satisfying the 'symmetry condition' Q(y|0) = Q{—y\l) (in the 
BSC case, this condition fulfilled if we use the +1, —1 notation for the channel output). 

The next step consists in establishing a measure of the performances of our code. There are 
several natural such measures, for instance the expected number of incorrect symbols, or of incorrect 
words. To simplify the arguments below, it is convenient to consider a slightly less natural measure, 
which conveys essentially the same information. Recall that, given a discrete random variable X, 
with distribution {p{x) : x ^ X} , its entropy, defined as 



is a measure of how 'uncertain' is X. Analogously, if X, Y are two random variables with joint 
distribution {p(x,y) : x G Af, y G [V}, the conditional entropy 



is a measure of how 'uncertain' is X once Y is given. 

Now consider a uniformly random codeword X_ and the corresponding channel output Y_ (as 
produced by the binary symmetric channel). The conditional entropy H{X\Y2) measures how many 
additional bits of information (beyond the channel output) do we need for reconstructing x from y. 
This is a fundamental quantity but sometimes difficult to evaluate because of its non-local nature. 
We shall therefore also consider the bitwise conditional entropy 




(1) 





(2) 




(3) 
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B. Bounding the conditional entropy 

Before trying to estimate these quantities, it is convenient to use the channel and code symmetry 
in order to simplify the task. Consider for instance the conditional entropy. Denoting by the 'all 
zero' codeword, we have 

= -Y,p{y\0)logp{0\y) = (5) 
= -Ey\ogp{0\y) , (6) 

where Ey denotes expectation with respect to the probability measure p{y\0) = Q{yi\0). In the 
BSC(p) case, under this measure, the yi are i.i.d. Bernoulli random variables with parameter p. 
Furthermore, by using Bayes theorem, we get 

H{X\Y) = -Ey logp{y\0) + Ey log | ^^^(ylx) i = (7) 
= -N^Q{y\0)logQ{y\0)+Eylog\^]jQ{yi\x,)]Jl{x,a(B---exia=0) \ .(8) 

y (. X i a J 

The first term is easy to evaluate consisting of a finite sum (or, at most, a finite-dimensional 
integral). The second one can be identified as the quenched free energy for a disordered model 
with binary variables (Ising spins) associated to vertices of the the Tanner graph G. Proceeding 
as in ® one also gets the following expression for the single bit conditional entropy 

H{X,\Y) = -Eylogp{xi = 0\y) . (9) 

A simple idea for bounding a conditional entropy is to use the 'data processing inequality'. This 
says that, if X ^ y ^ Z is a Markov chain, then H(X\Y) < H{X\Z). Let B(z,r) denote the 
subgraph of G whose variable nodes lie at a distance at most r from i (with the convention that a 
check node a belongs to B(i,r) only if all of the adjacent variable nodes belong to G). Denote by 

J. the vector of output symbols 1^-, such that j € B(z, r). The data processing inequality implies 

H{X,\Y) < HiX,\Y,^,) = -Eylogpix, = Oly^ J . (10) 

A little more work shows that this inequality still holds ii p{xi\y. ^) is computed as if there weren't 
parity checks outside B{i,r). In formulae, we can substitute p{xi = ^\y-^) with 

Pi^riXi = x\y.^^) = ^ PiA^i^rlVi^^)- (11) 
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FIG. 3: Radius 1 neighborhood of a typical site in the i in the Tanner graph. 

where 

PiA^^,r\yi^,) = — y n n e • • • e = 0) . (12) 

and Zi^riv^^) ensures the correct normalization of Pi,r{xij.\y^^). 

We are left with the task of computing Pi^r{xi = 0\y.^). As a warmup exercise, let us consider 
the case r = 1. Without loss of generality, we set i = 0. Because of the remark made in the previous 
Section, the subgraph B(0, 1) is, with high probability, a tree and must look like the graph in Fig.|31 
Using the notations introduced in this figure, and neglecting normalization constants (which can 
be computed at the very end), we have 

PoA^oIVq^^) cc^Qivolxo) Y{ n QiVjl^j) Y{'^(^o®xga\o = 0). (13) 

{xj} aedOjeda\0 aedO 

Here we used the notation di (da) to denote the set of check nodes (respectively variable nodes) 
adjacent to variable node i (resp. to check node a). Moreover, for A = {ii, . . . we wrote 
XA = Xij^ (B ■ ■ ■ (B Xi^. Rearranging the various summations, we get the expression 

0) n ' (14) 

aeaOxj , je9a\0 jg9a\0 

which is much simpler to evaluate due to its recursive structure. In order to stress this point, we 
can write the above formula as 

PQ,i{xo\yQ^^) oc Q{yo\xQ) fia^oixo) , (15) 

jla^Q{xo) OC ^ I(xo e XQa\Q = 0) ]^ ^J.j^a{xj) , (16) 

Hj^a{xj) OC Q{yi\xi). (17) 
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FIG. 4: Graphical representation of the belief propagation equations. 

The quantities {fj,j^a{xj)}, {fij^aixj)} are normalized distributions associated with the directed 
edges of G. They are referred to as beliefs or, more generally, messages. Notice that, in the 
computation of po^i{xo\y^ ^) only messages along edges in B(0, 1), directed toward the site 0, were 
relevant. 

In the last form, the computation of Pi,r{xi\y. is easily generalized to any finite r. We first 
notice that that B(z,r) is a tree with high probability. Therefore, we can condition on this event 
without much harm. Then we associate messages to the directed edges of B(i,r) (only messages 
directed towards i are necessary). Messages are computed according to the rules 

f^b-^j {Xj 

) , (18) 

b£dj\a 

fia^jiXj) OC ^ I{Xj®XQa\j = 0) f^k^a{Xj) , (19) 

Xk, ksda\j k£da\j 

with boundary condition flh__,j{xj) = 1/2 for all 6's outside B(z, r). These equations are represented 
graphically in Fig. 0] Finally, the desired marginal distribution is obtained as 

PiA^ilVi^r^ ^ Q{yi\xi) Yi fj-a^iiXi) . (20) 

Let us now forget for a moment our objective of proving an upper bound on the conditional 
entropy H{Xi\Y_). The intuitive picture is that, as r increases, the marginal pi^rixi\y. ^) incorpo- 
rates information coming form a larger number of received symbols and becomes a more accurate 
approximation oip{xi\y). Ideally, optimal decoding of the received message would require the com- 
putation oi p{xi\y), for which no efficient algorithm is known. In particular, the expected number 
of incorrect bits is minimized by the rule x{y) = argmax^. p(xj|y). We can however hope that 
nearly optimal performances can be obtained through the rule 

Xi,r{y) = argmax^^ PiA^ilVi^^) ■ (21) 

Furthermore, a moment of thought shows that the recursive procedure described above can be 
implemented in parallel for all the variables i € [A^]. We just need to initialize fib^j{xj) = 1/2 for 
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all the check-to- variable messages, and then iterate the update equations ((THl) . ((T^ at all nodes 
in G, exactly r times. For any fixed r, this requires just Q{N) operations which is probably the 
smallest computational effort one can hope for. 

Finally, although Eqs. (fTH|) . (fT^ only allow to compute pi^r{xi\y. ^) as far as B(i, r) is a tree, the 
algorithm is well defined for any value of r. One can hope to improve the performances by taking 
larger values of r. 

C. A parenthesis 

The algorithm we 'discovered' in the previous Section is in fact well known under the name of 
belief propagation (BP) and is widely adopted for decoding codes on graphs. This is in turn 
an example of a wider class of algorithms which are particularly adapted to problems defined on 
sparse graphs, and are called (in a self-explanatory way) message passing algorithms. We refer 
to Sec. IVII for some history and bibliography. 

Physicists will quickly recognize that Eqs. ()18() . ()19() are just the equations for Bethe-Peierls 
approximation in the model at hand P, Qj. Unlike the original Bethe equations, because of the 
quenched disorder, the solutions of these equations depend on the particular sample, and are not 
'translation invariant'. These two features make Eqs. H18() . ()19() analogous to Thouless, Anderson, 
Palmer (TAP) equations for mean field spin glasses. In fact Eqs. ^ are indeed the correct 

generalization of TAP equations for diluted models ^. Of course many of the classical issues in the 
context of the TAP (such as the existence of multiple solutions, treated in the lectures of Parisi at 
this School) approach have a direct algorithmic interpretation here. 

Belief propagation was introduced in the previous paragraph as an algorithm for approximately 
computing the marginals of the probability distribution 

p{2L\y) = ^ YlQiVjlxj) H lix,^^ (B ■ ■ ■ (B Xi^^ = 0) , (22) 

- je[N] ae[M] 

which is 'naturally' associated to the graph G. It is however clear that the functions Q{yj\xj) and 
I{xia © • • • © = 0) do not have anything special. They could be replaced by any set of compat- 
ibility functions tpi{xi), '^^(xjj, . . . , Xj^). Equations ()18|) . H19() are immediately generalized. BP 
can therefore be regarded as a general inference algorithm for probabilistic graphical models. 



^ In the theory of mean field disordered spin models, one speaks of diluted models whenever the number of interaction 
terms (M in the present case) scales as the size of the system. 
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The success of message passing decoding has stimulated several new applications of this strategy: 
let us mention a few of them. Mezard, Parisi and Zecchina 4] introduced 'survey propagation', 
an algorithm which is meant to generalize BP to the case in which the underlying probability 
distribution decomposes in an exponential number of pure states (replica symmetry breaking, RSB). 
In agreement with the prediction of RSB for many families of random combinatorial optimization 
problems, survey propagation proved extremely effective in this context. 

One interesting feature of BP is its decentralized nature. One can imagine that computation 
is performed locally at variable and check nodes. This is particularly interesting for applications 
in which a large number of elements with moderate computational power must perform some 
computation collectively (as is the case in sensor networks). Van Roy and Moallemi ^ proposed 
a 'consensus propagation' algorithm for accomplishing some of these tasks. 

It is sometimes the case that inference must be carried out in a situation where a well established 
probabilistic model is not available. One possibility in this case is to perform 'parametric inference' 
(roughly speaking, some parameters of the model are left free). Sturmfels proposed a 'polytope 
propagation' algorithm for these cases ^. ll^. 



IV. DENSITY EVOLUTION A.K.A. DISTRIBUTIONAL RECURSIVE EQUATIONS 



Evaluating the upper bound (|lfljl on the conditional entropy described in the previous Section, 
is essentially the same as analyzing the BP decoding algorithm defined by Eqs. ()18() and H19() . In 
order to accomplish this task, it is convenient to notice that distributions over a binary variable 
can be parametrized by a single real number. It is customary to choose this parameters to be the 
log-likelihood ratios (the present definition differs by a factor 1/2 from the traditional one): 

V^^a = T; log 7— , Va^i = - log -— . (23) 

We further define hi = ^ log g|^'j°| . In terms of these quantities, Eqs. (|TH)) and (|T^ read 



vf^a^ = hj+ vl^'l- , v^^lj = atanh 

b£dj\a 



kGda\j 



(24) 



Notice that we added an index r G {0, 1,2,...} that can be interpreted in two equivalent ways. 
On the one hand, the message Vj^a conveys information on the bit xj coming from a ('directed') 
neighborhood of radius r. On the other r indicates the number of iterations in the BP algorithm. 
As for the messages, we can encode the conditional distribution Pi^r{xi\y. ) through the single 
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(r) 1 Pi,r{0\y. ) mrm. 

number u • = ^ log (i]^''*^) • Equation (|2D|) then reads 



+ (25) 

bedi 

Assume now that the graph G is distributed accordingly to the GAr(A, P) ensemble and that 
i — > a is a uniformly random (directed) edge in this graph. It is intuitively clear (and can be 
proved as well) that, for any giver r vl_^^ will converge in distribution to some well defined random 
variable f^^'-*. This can be defined recursively by setting v^^^ = h and, for any r > 



i-i 



+ {)M^atanh 



6=1 



fc-1 

tanh Vj'^ 



(26) 



where {vi^ } are i.i.d. random variables distributed as v^''^'\ and {vj } are i.i.d. random variables 
distributed as v^'^\ Furthermore I and k are integer random variable with distributions, respectively 
Xi, pk depending on the code ensemble: 

In other words v^"^^ is the message at the root of a random tree whose offspring distributions are 
given by A/, pk- This is exactly the asymptotic distribution of the tree inside the ball B(i,r). The 
recursions H26|) are known in coding theory as density evolution equations. They are indeed the 
same (in the present context) as Aldous' recursive distributional equations 7], or replica symmetric 
equations in spin glass theory. From Eq. (|25() one easily deduces that uf '' converges in distribution 
to u^*") defined by u^''"*"^-' = h + X]|,=i ^h '^ with / distributed according to A;. 

At this point we can use Eq. (|10j) to derive a bound on the bitwise entropy /ib- Denote by h{u) 
the entropy of a binary variable whose log-likelihood ratio is u. Explicitly 

h{u) = -(1 + e-2«)-i log(l + e-2«)-i - (1 + e2")-i log(l + e^")"! . (28) 

Then we have, for any r, 

lim Ec/ib < E/i(u('')) , (29) 

where we emphasized that the expectation on the left hand side has to be taken with respect to 
the code. 

It is easy to show that the right hand side of the above inequality is non-increasing with r. It 
is therefore important to study its asymptotic behavior. As r — > oo, the random variables v^^^ 
converge to a limit which depends on the code ensemble as well as on the channel transition 
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p — 0.35 < pBP p = 0.43 < pbp P — 0.58 < pbp P = 0.61 > pep 

FIG. 5: Graphical representation of density evolution for the binary erasure channel, cf. Eq. (jSOJ. 

probabilities Q{y\x). Usually one is interested in a continuous family of channels indexed by a noise 
parameter p as for the BSC(p), indexed in such a way that the channel 'worsen' as p increases (this 
notion can be made precise and is called physical degradation). For 'good' code ensembles, 
the following scenario holds. For small enough p, = +oo with probability one. Above some 
critical value pbp, v^°°^ ^ with non-zero probability. It can be shown that no intermediate case 
is possible. In the first case BP is able to recover the transmitted codeword (apart, eventually, 
from a vanishingly small fraction of bits), and the bound (|'29j) yields Ec hy, 0. In the second the 
upper bound remains strictly positive in the r — > cxo limit. 

A particularly simple channel model, allowing to work out in detail the behavior of density 
evolution, is the binary erasure channel BEC(p). In this case the channel output can take the 
values {0, 1, ?}. The transition probabilities are Q{0\0) = (3(l|l) = l-p and Q(?|0) = Q(?|l) = p. 
In other words, each input is erased independently with probability p, and transmitted incorrupted 
otherwise. We shall further assume, for the sake of simplicity, that the random variables / and k 
in Eq. (|26|) are indeed deterministic. In the other words all variable nodes (parity check nodes) in 
the Tanner graph have degree / (degree k). It is not hard to realize, under the assumption that the 
all zero-codeword has been transmitted, that in this case v^"^^ takes values or -|-oo. If we denote 
by Zr the probability that v^^'^ = 0, the density evolution equations (|26j) become simply 




The functions fp{z) = (z/p)^/^'-^^^ and g{z) = 1 — (1 — z)'^~^ are plotted in Fig. [Jlfor I = 4, k = 5 
and a few values of p approaching pbp- The recursion (IHU]) can be described as 'bouncing back 
and forth' between the curves fp{z) and g{z). A little calculus shows that Zr ^ if p < pbp while 
Zr — > z^{p) > for p > Pbp, where pBP ~ 0.6001110 in the case / = 4, A; = 5. A simple exercise for 
the reader is to work out the upper bound on the r.h.s. of Eq. H29() for this case and studying it as 
a function of p. 
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V. THE AREA THEOREM AND SOME GENERAL QUESTIONS 



Let us finally notice that general information theory considerations imply that H{X_\Y_) < 

- H(Xi\Y_). As a consequence the total entropy per bit H{X\Y_)/N vanishes as well for p < pbp- 
However this inequality greatly overestimates H{X\Y_): bits entering in the same parity check are, 
for instance, highly correlated. How can a better estimate be obtained? 

The bound in Eq. (|29j) can be expressed by saying that the actual entropy is strictly smaller than 
the one 'seen' by BP. Does it become strictly positive for p > pbp because of the sup-optimality of 
belief propagation or because H{X\Y_)/N is genuinely positive? 

More in general, below pbp BP is essentially optimal. What happens above? A way to state 
more precisely this question consists in defining the distortion 



which measures the distance between the BP marginals and the actual ones. Below pbp, D^p^r 
as r ^ oo. What happens above? 

It turns out that all of these questions are strictly related. We shall briefly sketch an answer 
to the first one and refer to the literature for the others. However, it is worth discussing why they 
are challenging, considering in particular the last one (which somehow implies the others). Both 
p{xi\y) and pi^rixi\y) can be regarded as marginals of some distribution on the variables associated 
to the tree B{i,r). While, in the second case, this distribution has the form ()12() . in the first 
one some complicated (and correlated) boundary condition must be added in order to keep into 
account the effect of the code outside B(i,r). Life would be easy if the distribution of Xi were 
asymptotically decorrelated from the boundary condition as r ^ oo, for any boundary condition. 
In mathematical physics terms, the infinite tree (obtained by taking r ^ oo limit after N ^ oo) 

n 

supports a unique Gibbs measure 8]. In this case p{xi\y) and Pi,rixi\y) simply correspond to two 
different boundary conditions and must coincide as r ^ oo. Unhappily, it is easy to convince 
oneself that this is never the case for good codes! In this case no degree or 1 variables exists and 
a fixed boundary condition always determines uniquely Xi (and more than one such condition is 
admitted). 

As promised above, we conclude by explaining how to obtain a better estimate of the conditional 
entropy H{2L\Y.)- It turns out that this also provides a tool to tackle the other questions above, 
but we will not explain how. Denote hy Wi = log , '_.r~\ the log-likelihood ratio which keeps 

(r) 

into account all the information pertaining bits Xj, with j different from i, and let wl be the 




(31) 
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corresponding r-iterations BP estimates. Finally, let w^'^'^ be the weak limit of wl^^ (this is given 
by density evolution, in terms of v^'^~^'^). We introduce the so-called GEXIT function g{w). For 
the channel BSC(p) this reads 

^~P-2w\ 1 Ji , P -2w 



g{w) = log, 1 1 + ^ e-^- j - log2 1 1 + e j • (32) 

And a general definition can be found in 9]. It turns out that Kg(w^''''^) is a decreasing function of 
r (in this respect, it is similar to the entropy kernel h{u), cf. Eq. (|28() 1. Remarkably, the following 
area theorem holds 

^ rpi 

H{X\Y{pi)) - H{X\Y{po)) = Yl / 9{wi) dp , (33) 

i=l 

where Y_{po), YJyPi) denotes the output upon transmitting through channels with noise levels pq 
and pi. Estimating the WiS through their BP version, fixing po = 1/2 (we stick, for the sake of 
simplicity to the BSC(p) case) and noticing that -f^(:^|^(l/2)) = NR, one gets 

/■1/2 

H{X\Y_{p)) /N >R- E g{w^'''> ) dp . (34) 
Jp 

The bound obtained by taking r ^ oo on the r.h.s. is expected to be asymptotically (as — > oo) 
exact for a large variety of code ensembles. 

VI. HISTORICAL AND BIBLIOGRAPHICAL NOTE 



Information theory and the very idea of random code ensembles were first formulated by Claude 
Shannon in Random code constructions were never taken seriously from a practical point of 
view until the invention of turbo codes by Claude Berrou and Alain Glavieux in 1993 [ll|. This 
motivated a large amount of theoretical work on sparse graph codes and iterative decoding methods. 
An important step was the 're-discovery' of low density parity check codes, which were invented in 



1963 by Robert Gallager ll2| but soon forgotten afterwards. For an introduction to the subject and 
a more comprehensive list of references see as well as the upcoming book See also |Q| for 
a more general introduction to belief propagation with particular attention to coding applications. 
The conditional entropy (or mutual information) for this systems was initially computed us- 




171 . Il8l . Il9l | using a correspondence first found by 



ing non-rigorous statistical mechanics methods 

Nicolas Sourlas [2^. These results were later proved to provide a lower bound using Guerra's 



interpolation technique 



2ll |. cf. also Francesco Guerra's lectures at this School. Finally, an in- 
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dependent (rigorous) approach based on the area theorem was developed in 
upper bounds were proved in particular cases in [2^ ■ 



S 
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